V0 version of embedding ingestion core. by shixiao-coder · Pull Request #1964 · datacommonsorg/data

shixiao-coder · 2026-04-17T16:45:05Z

It includes:
requirements.txt to install the related packages
placeholder for main.py function: placeholder to trigger various embedding ingestion logic
embedding_utils.py including function to:

read the latest timestamp for the lock, if no timestamp set None
function to find Node IDs to update based on timestamp, nodetype, validity by none empty
function to convert the Node and fields to ID and embedding_content
function to generate embeddings from the ID and embedding content in batch

It includes: requirements.txt to install the related packages embedding_utils.py including function to: - read the latest timestamp for the lock, if no timestamp set None - function to find Node IDs to update based on timestamp, nodetype, validity by none empty - function to convert the Node and fields to ID and embedding_content - function to generate embeddings from the ID and embedding content in batch

gemini-code-assist

Code Review

This pull request introduces a Cloud Function and helper utilities for automating node embedding workflows using Google Cloud Spanner. Key features include fetching updated nodes based on ingestion locks and processing embeddings in batches via ML.PREDICT. Feedback identifies critical issues such as a missing time import and the use of unsupported INSERT OR UPDATE syntax in Spanner SQL. Other improvements include correcting a typo in the initialization action string, removing duplicate imports and redundant initializations, and optimizing result set processing.

…st on a request of 250. If the batch is smaller than 250 and or not divisible by 250. It actually send more requests to Embeddings models, with each batch containing a much smaller number. Batch from 100 -> 500 changes QPM usage from 1000 to 700 Timeout is set since each request is now containing 250 data and will run longer

…loud run Running with experiment deployed image and confirmed proper ingestion

…sed for filtering Nodes

gmechali

Thanks Xiao, this looks great, left a coupe minor comments + one on the query and maybe how to optimize it.

My main feedback is that we need tests to go in this PR too! Can you add a tests/ directory and add unit tests for embedding_utils and e2e tests for main.py?

gmechali · 2026-04-22T13:43:11Z

+    Yields:
+        Dictionaries containing subject_id and name.
+    """
+    timestamp_condition = "update_timestamp > @timestamp" if timestamp else "TRUE"


Just to double check, did you get approval from @keyurva and the data team to make this change to the schema to support the timestamp on the Node Table?

Updating header message for code

shixiao-coder · 2026-04-22T14:55:45Z

Thanks Xiao, this looks great, left a coupe minor comments + one on the query and maybe how to optimize it.

My main feedback is that we need tests to go in this PR too! Can you add a tests/ directory and add unit tests for embedding_utils and e2e tests for main.py?

Added @gmechali

shixiao-coder requested a review from gmechali April 17, 2026 16:45

Merge branch 'master' into v0-embedding-ingestion-core-logic

8df3d91

shixiao-coder requested a review from clincoln8 April 17, 2026 16:46

gemini-code-assist Bot reviewed Apr 17, 2026

View reviewed changes

shixiao-coder added 5 commits April 17, 2026 14:07

Update by comments

25a3406

Merge branch 'master' into v0-embedding-ingestion-core-logic

880fa0d

Merge branch 'master' into v0-embedding-ingestion-core-logic

b7e9ba9

Reine the logic to use timestamp to filter nodes

844b92e

gmechali reviewed Apr 20, 2026

View reviewed changes

Comment thread import-automation/workflow/embedding-helper/embedding_utils.py

Comment thread import-automation/workflow/embedding-helper/embedding_utils.py

shixiao-coder added 2 commits April 20, 2026 17:04

Updated to pass data by stream and related Docker to be deployed to c…

470c64b

…loud run Running with experiment deployed image and confirmed proper ingestion

Merge branch 'master' into v0-embedding-ingestion-core-logic

ca07810

shixiao-coder requested review from gmechali and vish-cs April 21, 2026 15:41

shixiao-coder added 2 commits April 21, 2026 16:04

Update the NodeEmbeddings table to contain the types. Types will be u…

2a4a745

…sed for filtering Nodes

Merge branch 'master' into v0-embedding-ingestion-core-logic

da71463

gmechali reviewed Apr 22, 2026

View reviewed changes

Add tests for all embedding util functions as well E2E for main.

11465f2

Updating header message for code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V0 version of embedding ingestion core.#1964

V0 version of embedding ingestion core.#1964
shixiao-coder wants to merge 12 commits intodatacommonsorg:masterfrom
shixiao-coder:v0-embedding-ingestion-core-logic

shixiao-coder commented Apr 17, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gmechali left a comment

Uh oh!

Uh oh!

Uh oh!

gmechali Apr 22, 2026

Uh oh!

Uh oh!

shixiao-coder commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shixiao-coder commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gmechali left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gmechali Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shixiao-coder commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shixiao-coder commented Apr 17, 2026 •

edited

Loading